Team members and contributions

Sandip Sonawane

Anurag Anand


EDA

Users File Description

================================================================================

User information is in the file "users.dat" and is in the following format:

UserID::Gender::Age::Occupation::Zip-code

All demographic information is provided voluntarily by the users and is not checked for accuracy. Only users who have provided some demographic information are included in this data set.

Movies File Description

================================================================================

Movie information is in the file "movies.dat" and is in the following format:

MovieID::Title::Genres

Ratings File Description

================================================================================

All ratings are contained in the file "ratings.dat" and are in the following format:

UserID::MovieID::Rating::Timestamp

Normalize the ratings by user ratings

We can observe from above table how rating_normalized takes care of user preference to always give high ratings or low ratings. Specifically, we can observe row number 3 and 5. Avg. rating for movie 3 is actually high, but after normalization, its lower than movie 5.

Ratings per Movie

Ratings per User

There are 177 movies that have not been rated by any user.

Distribution of Genres

Most of the movies are Drama and Comedy

Film-Noir has highest average rating, but number of such movies are very less compared to other genre. It is followed by Documentary.


System I: Recommending top movies by Genre

image.png

Scheme 1

Scheme 2


System II: Building Recommender System using collaborative Filtering

Idea 1: User Based Collaborative Filtering (UBCF)

image.png

image.png

Will you normalize the rating matrix? If so, which normalization option do you use?

What's the nearest neighborhood size you use?

Which similarity metric do you use?

If you say prediction is based on a "weighted average", then explain what weights you use.

Will you still have missing values after running the algorithm? If so, how do you handle those missing values?

Checking performance of this algorithm

Predicting movie recommendations for new user

Above movie IDs will be recommended to the new user.


Idea 2: Item Based Collaborative Filtering (IBCF)

image.png

Will you normalize the rating matrix? If so, which normalization option do you use?

What's the nearest neighborhood size you use?

Which similarity metric do you use?

If you say prediction is based on a "weighted average", then explain what weights you use.

Will you still have missing values after running the algorithm? If so, how do you handle those missing values?

Checking Performance of this algorithm


Model Comparison

As we can see from the below plots, UBCF has lower RMSE. Hence this model was chosen for deployment.


Deployment

Technologies Used

Backend: Python, Django framework, Django plotly dash module

Frontend: JavaScript, CSS, HTML

Server: AWS EC2 Instance small

System I: http://18.191.179.113:8000/movie_recommender/genre

OR

http://ec2-18-191-179-113.us-east-2.compute.amazonaws.com:8000/movie_recommender/genre

System II: http://18.191.179.113:8000/movie_recommender/

OR

http://ec2-18-191-179-113.us-east-2.compute.amazonaws.com:8000/movie_recommender/

Note: Please allow some time for the page to load.


References

  1. Book Recommender: Collaborative Filtering, Shiny. PHILIPP SPACHTHOLZ https://github.com/pspachtholz/BookRecommender
  2. Recommendation Systems. Google https://developers.google.com/machine-learning/recommendation/
  3. STAT 542 - Campuswire posts on Project 4. UIUC https://campuswire.com/c/G497EEF81